A clustering-based topic model using word networks and word embeddings

نویسندگان

چکیده

Abstract Online social networking services like Twitter are frequently used for discussions on numerous topics of interest, which range from mainstream and popular (e.g., music movies) to niche specialized politics). Due the popularity such services, it is a challenging task automatically model determine discussion given large amount tweets. Adding this complexity need identify these with absence prior knowledge about both types number topics, while having requirement relevant technical expertise tune parameters various models. To address challenge, we develop Clustering-based Topic Modelling (ClusTop) algorithm that first constructs different word networks based n-grams co-occurrence embedding distances. Using networks, ClusTop then able using community detection approaches. In contrast traditional topic models, does not require tuning or setting instead uses approaches appropriate topics. The also capture syntactic meaning in tweets via use bigrams, trigrams, other combinations techniques constructing network graph, utilizes edge weights embedding. three datasets labelled crises events as show outperforms baselines terms coherence, pointwise mutual information, precision, recall F-score.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Correlated Topic Model Using Word Embeddings

Conventional correlated topic models are able to capture correlation structure among latent topics by replacing the Dirichlet prior with the logistic normal distribution. Word embeddings have been proven to be able to capture semantic regularities in language. Therefore, the semantic relatedness and correlations between words can be directly calculated in the word embedding space, for example, ...

متن کامل

Topic Sentiment Joint Model with Word Embeddings

Topic sentiment joint model is an extended model which aims to deal with the problem of detecting sentiments and topics simultaneously from online reviews. Most of existing topic sentiment joint modeling algorithms infer resulting distributions from the co-occurrence of words. But when the training corpus is short and small, the resulting distributions might be not very satisfying. In this pape...

متن کامل

Topic Modeling Using Distributed Word Embeddings

We propose a new algorithm for topic modeling, Vec2Topic, that identifies the main topics in a corpus using semantic information captured via high-dimensional distributed word embeddings. Our technique is unsupervised and generates a list of topics ranked with respect to importance. We find that it works better than existing topic modeling techniques such as Latent Dirichlet Allocation for iden...

متن کامل

Topic Modelling with Word Embeddings

English. This work aims at evaluating and comparing two different frameworks for the unsupervised topic modelling of the CompWHoB Corpus, namely our political-linguistic dataset. The first approach is represented by the application of the latent DirichLet Allocation (henceforth LDA), defining the evaluation of this model as baseline of comparison. The second framework employs Word2Vec technique...

متن کامل

Nonparametric Spherical Topic Modeling with Word Embeddings

Traditional topic models do not account for semantic regularities in language. Recent distributional representations of words exhibit semantic consistency over directional metrics such as cosine similarity. However, neither categorical nor Gaussian observational distributions used in existing topic models are appropriate to leverage such correlations. In this paper, we propose to use the von Mi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Big Data

سال: 2022

ISSN: ['2196-1115']

DOI: https://doi.org/10.1186/s40537-022-00585-4